Add RFC2047-compliant MIME text decoder #9313
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Trac ticket: Core-63864
Status
Please feel free to ignore this for now.
Description
The existing
wp_iso_descrambler()was added in 2004 because certain email subjects were appearing with funny-looking string spans. The following note was left as a comment:But even so, it’s only likely to truly work with
US-ASCII, which is rare to find in such a MIME-encoded string. In 2004 it might have been more common for PHP systems to operate on ISO-8859-1 (latin1) as their default, but today UTF-8 is the predominant encoding and because the function return the bytes as they are directly encoded, it fails to perform its main function which is to translate non-ASCII encodings.The above image illustrates how the bytes print as an invalid UTF-8 sequence in
trunkafter decoding. The 0x80 byte was chosen for this demonstration because inlatin1it’s a control character, incp1252and in HTML it’s remapped to the Euro sign, and in UTF-8 it’s an invalid sequence.Without additional conversion calling code has to know the additional details of what the encoding is of the running PHP system and what other code will perform re-encoding. It’s likely to mess up. Worse, if the encoding is not
ISO-8859-1(latin1) then the decoding is wrong for all character sets.This patch implements a compliant RFC2047 MIME text decoder, and decodes the text into UTF-8. Decoding into a single encoding normalizes the output and gives calling code the freedom to change the encoding if it wants without needing to make any assumptions or inquire about what it gets.
With the same input as above we can see that the default output is now converted from the indicated input encoding. In this example, that decodes to a control character in UTF-8 but that is authentic to the given input. The re-encodings are now invalid because the returned data is already in UTF-8.
Supported encodings
This implementation attempts to support as many encodings as are practical based on the availability of decoding logic on the running server.
If
mb_convert_encoding()is available it will be preferred, followed byiconv(), followed by direct conversion from US-ASCII or UTF-8 byte streams. Nuances and peculiarities of the PHP text-encoding functions are left as artifacts of PHP and not addressed in this function.Error handling
Unfortunately, even where
iconv_mime_decode()is available, its error-handling options are limited and unclear. By implementing the encoder in user-space the error cases can be explicitly handled, and this implementation provides configurable error handling:preserve-errorsflag. The input text will appear in the output and look jumbled, but perhaps a human can make sense of the data in it. This is how most decoders handle errors.replace-errorswill remove the entire encoded word and replace it with the replacement character U+FFFD�. This discards information from the input, but leaves a placemarker indicating that it was there before.bail-on-errorwill cause the function to return early and returnnull, effectively the same as thestrictmode in other decoders.There are multiple classes of potential errors and error behavior is not defined in the RFC. This implementation treats all classes in the same way, except for the rule that encoded words must be 75 characters or shorter (as this rule was clearly intended for encoders to make the job of decoding simpler, but otherwise does not speak to the well-formedness of the encoding).
BandQare supported).=.or=6f(only upper-case hex digits are allowed).Of note, the RFC implies no possible syntax errors. Instead, anything which appears as a syntax error indicates that the span of text which looks like an encoded word is actually just plain text and the parser will skip over it to look for the next well-formed encoded word.
Notes